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The manual process of searching out individuals in an already existing 
research field is cumbersome and time-consuming. Prominent and rookie 
researchers alike are predisposed to seek existing research publications in 
a research field of interest before coming up with a thesis. From 
extant literature, automated similar research area detection systems have 
been developed to solve this problem. However, most of them use 
keyword-matching techniques, which do not sufficiently capture the implicit 
semantics of keywords thereby leaving out some research articles. In this 
study, we propose the use of ontology-based pre-processing, Latent Semantic 
Indexing and K-Means Clustering to develop a prototype similar research area 
detection system, that can be used to determine similar research domain 
publications. Our proposed system solves the challenge of high dimensionality 
and data sparsity faced by the traditional document clustering technique. Our 
system is evaluated with randomly selected publications from faculties 


in Nigerian universities and results show that the integration of ontologies 
in preprocessing provides more accurate clustering results. 
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1. INTRODUCTION 
Background of study 

For the purpose of research, topics and relevant articles from digital libraries and online databases are 
sought for in order to gain a better understanding. These sources are useful in the retrieval of relevant articles 
by renowned researchers [1]. Citations (a document cites another), bibliographic coupling (documents sharing 
a reference in their bibliography) and co-word linkages (documents share certain words) are some of 
the existing methods that have been used to identify textual documents that are necessary for research. 
Traditional document clustering approaches do not sufficiently capture semantic relations between keywords 
leading to ambiguity and high dimensionality thereby reducing the accuracy of clustering results [2, 3]. 
Traditional clustering methods often ignore semantic relationships or connections between words and there 
produce inaccurate representations of such articles. Text data carry high level semantic information and diverse 
vocabulary, hence the need for text clustering techniques to improve quality. 
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When a prospective researcher embarks on a research endeavor, the researcher must conduct a review 
of related works already documented in literature in that field for a robust research publication. The researcher 
would This study utilizes a dataset containing bibliographic information of Nigerian researchers as a case study 
and applies co-word linkages to determine documents in the same field. The goal of this study is to develop 
a semantics-based clustering framework for detecting similar research areas. The system determines 
the suitability of an author to a research field based on the semantic similarity between an author’s work and 
previous works. This would be particularly useful for researchers who intend to locate researchers within their 
research field and similar research field. The detection of similar research areas based on keywords could prove 
beneficial to tertiary institutions of learning and research center. The research networks created is intended to 
enhance the prospect of research collaborations in the continent. According to [4-6], the existing methods that 
have been used to identify textual documents that address a particular subject matter include: 

- Citations (a document cites another) 

- Bibliographic coupling (documents sharing a reference in their bibliography) 
- Co-word linkages (documents share certain words) 

The problem 

Traditional document clustering approaches do not sufficiently capture semantic relations between 
keywords leading to ambiguity and high dimensionality thereby reducing the accuracy of clustering 
results [7]. Existing keyword matching techniques can be significantly improved by integrating semantics in 
document similarity computation. The detection of similar research areas based on keywords could prove 
beneficial to tertiary institutions of learning and research centres. This study intends to utilize a dataset 
containing bibliographic information of Nigerian researchers as a case study. We expect the research networks 
created to enhance the prospect of research collaborations in the continent 
The proposed solution 

In this study, an automated similar research area detection system is proposed that generates similar 
research areas and publications to that of a prospective researcher. Based on the expertise of the prospective 
researcher, the system automatically assigns prospective researchers to already existing researchers in the same 
or similar research field. This is done using the similarity score obtained using the LSA model and 
cosine similarity to calculate the semantic similarity between them. The proposed system is built for researchers 
to detect similar research areas depending on their research preferences. The system is accessible as 
a web application. 

Therefore, the aim of the research was to develop a semantics-based clustering method for detecting 
similar research areas using Nigerian publications as a case study and to achieve this aim, the following 
objectives were carried out; 

- Creation of a dataset 

- Develop a framework for similar research area detection 
- Implement a prototype for the proposed framework 

- Validation and evaluation of the proposed approach 


2. LITERATURE REVIEW 

A review of existing semantic clustering techniques already document in literature is outlined 
below [8] presented a deep hypergraph model for sentiment classification and online reviews. The model used 
a hierarchical clustering algorithm to discover semantic cliques. The model, tested with movie reviews and 
product reviews (books, DVD, electronic and kitchen) was compared with seven other methods of sentiment 
classification and results showed the model outperformed all other methods in all cases. Also, [9] used semantic 
clustering to locate and access web documents. The text corpus is pre-processed, stemming is performed using 
the WordNet ontology. The term frequency-inverse document frequency algorithm was used to construct 
a feature matrix. Hierarchical agglomerative clustering was used to perform clustering on the feature matrix. 
The approach used, improved the accuracy of the clusters generated. The drawback is that Hierarchical 
agglomerative clustering is not suitable for large datasets. Similarly, [10] performed enhanced semantic 
clustering with the WordNet ontology. The text corpus was pre-processed with WordNet ontology to perform 
word sense disambiguation. The term frequency-inverse document frequency technique is used to derive 
a feature representation of the words in the text corpus. The K-Means clustering algorithm is applied to cluster 
the feature vectors [11]. The pre-processing method used eliminated the dimensionality problem encountered 
in traditional document clustering. The limitation is that the K-Means clustering algorithm suffers from 
the local optima problem. In [12] as well used Semantic clustering to solve the topic drift problem 
in information retrieval systems. Search snippets are preprocessed and extract the longest common 
subsequence between two snippets by GST. Evaluate Word similarity using HowNet ontology and construct a 
lexical chain to select features of snippets. A feature vector is constructed and evaluate snippet similarities. 
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The Improved Chameleon algorithm used for clustering the feature vectors improved the within-class density 
and between-class variation of the cluster labels. However, the Chameleon algorithm has a high time and space 
complexity which does not make it suitable for high dimensional datasets. 

In [13] used semantic clustering to cluster search result documents based on the semantics of retrieved 
documents. A search engine was queried to retrieve information, the search engine results were then 
pre-processed, and extraction of features was carried out. The features were enhanced using concepts from 
an ontology and semantic network. A dissimilarity matrix of the documents was created using 
the Floyd-Warshall algorithm [14]. The Hierarchical agglomerative clustering algorithm is used to cluster 
the feature vectors in the similarity matrix. High precision results were obtained as the approach outperformed 
existing approaches for web search result clustering. Hierarchical agglomerative clustering can be 
computationally expensive to use for large datasets. In [15] also proposed the use of semantic clustering to 
determine similar text documents. The text corpus is extracted and pre-processed. The term frequency-inverse 
document frequency algorithm is used to identify frequently occurring terms and construct a document matrix. 
A domain ontology is constructed from the text corpus to provide a vocabulary for filtering relevant terms. 
A Fuzzy equivalence relation is used to determine the level of membership of terms in the text corpus. 
Singular value decomposition is used to transform the document matrix into a concept space. Bisecting 
K-means algorithm is used to perform clustering of the concept space. The use of a domain ontology in 
the pre-processing stage improves clustering results. The limitation of the method is the performance of 
the method is entirely dependent on the quality and comprehensiveness of the ontology used. In [16] as well 
used semantic clustering to classify customer reviews. The text corpus is extracted by crawling customer review 
websites and then it is pre-processed. Ontology is used to generate a concept mapping in the text corpus. 
Euclidean distance metrics are used to calculate the similarity of sentences in the bag of words vector space 
model. The modified K-Means algorithm is used to cluster the bag of words. Experimental results revealed that 
the accuracy of the clusters generated is increased with the use of ontology in the pre-processing stage. 

Further methods for identifying existing collaborations between various researchers from various 
publication databases are presented below [17] developed a co-authorship network to reveal the interactions 
between researchers. A system for selecting a collaborator with similar research interests for joint research was 
modeled as a link prediction problem. Authors with similar known features were determined using Cosine 
similarity computed on a vector constructed to model the descriptive statistics of various research activities. 
The co-authorship network was determined using the hierarchical clustering of research interests in various 
co-occurrence networks. They also used logistic regression with lasso regularization on normalized feature 
vectors. The disadvantage of this approach is that information retrieval was used to obtain data from 
the bibliography database and the semantic meaning of the terms was not taken into consideration. 

A novel architecture was proposed by [18] for joining multiple bibliographic sources to identify 
common research areas and relationships between authors and their publications. The scientific publications 
were retrieved from various bibliographic sources using APIs and Linked data practices, the data is analyzed 
to provide a structure and if there is no explicit structure, the data model is produced using web scraping. 
An ontology mapping model is used to unify the data from different bibliographic sources, and data 
disambiguation is performed to eliminate data inconsistencies and duplications. A vector space model of 
the retrieved information is generated using the TF-IDF algorithm. The K-means clustering algorithm using 
the Cosine Similarity measure was used to automatically discover similarities and group the authors into their 
research areas. The information retrieval method used did not utilize semantics in retrieving 
the information. In [19] proposed a method for determining collaborations between university research and 
industry research. A heterogeneous social network [20] was constructed to describe the relationship between 
researchers and companies, a company and a researcher are deemed to have a relationship if they have 
co-authored academic articles, co-participated in projects or co-invented patents. A dataset [21, 22] is created 
for researchers who have directly collaborated with companies before, it is assumed that researchers within 
the same domain as the researchers in this dataset can collaborate with them. Company similarity is also used 
to determine potential collaborations, companies that have worked on similar patents, articles, and projects. 
Keywords are extracted from company technological documents, pre-processing is then performed using 
tokenization, stop words removal, normalization and stemming, the vector space model is used to index 
the extracted keyword frequencies. The cosine similarity measure is used to determine the similarity of 
the different companies using their keywords, the top K most similar companies are stored as neighbor 
companies and researchers with connections to a neighboring company are potential collaborators of its 
neighboring companies. The approach used by [23] to predict potential research collaborations involves 
the use of the online social network [20] to determine the co-authorship network. The latent dirichlet allocation 
(LDA) algorithm is used to model a set of topics from a document corpus consisting of authored papers, these 
are then represented in a K-dimensional vector. LDA is also used to determine the content similarity of 
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the papers. The weighted support vector machine was used to perform link prediction to determine potential 
collaborators. The drawback of this method is that the use of LDA makes it computationally expensive. 


2.1. Document clustering 

Document clustering is the task of grouping a set of text documents into groups or clusters. Documents 
belonging to the same cluster share the same features while documents in another cluster do not share similar 
features. Traditional document clustering techniques use a bag of words representation; which do not take 
semantics into consideration. In [2] described the typical process of document clustering in Figure 1. 


Apply pre-processin; Keyword 
f mma fo Apply prep 7 Words Generation 


Construct 
the 
feature space 



















Cm) Cluster to the clusters Value of 
(sn = = 


Figure 1. Document clustering process [2] 
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2.2. Feature representation 
2.2.1. Vector space model 

The vector space model (VSM) is largely considered the basic model for feature representation and 
has been modified severely to cater for its inadequacies. In the generic VSM model, each text document is 
represented as follows: 


g:d+> od) = [Wed Wez d, wen al E NP (1) 


The term frequency-inverse document frequency technique has been widely used for feature representation of 
text documents, where is the TF-IDF weight of term t in document d. T denotes the transpose operator, denotes 
the document d as a weighted term vector in the m-dimensional space of terms. This function could also 
represent the mapping of a document to its vector space representation. The documents are then weighted by 
their inverse document frequency (IDF). This weighting is done to determine terms that appear frequently 
across the set of text documents. The TF-IDF matrix is represented thus: 


(tf — idf)ij = tfij x idfi (2) 


The drawback of the VSM model is that it suffers from the sparsity problem and does not perform well with 
large documents [24]. 


2.2.2. N-gram model 

The N-gram model traditionally focuses on bi-grams, which are pairs of words but recently, the use 
of character N-grams and byte N-grams is commonplace. Character N-gram is a language autonomous text 
representation method. Text documents are transformed into high-dimensional feature vectors where features 
represent substrings. N-grams are typically N adjacent characters from the alphabet. The dimensionality of 
N-gram features can be as high as IAIN even for mid-range values of N. Generally, only a sizeable portion of 
N-grams are available in a given set of text documents. The N-gram model has the further advantage of being 
robust and tolerant of grammatical and typographical errors [25]. The limitation of the n-gram model is that 
the semantics of words and word order is not taken into consideration. 


2.2.3. Latent semantic indexing 

Latent semantic indexing is an algebraic based algorithm that is used for feature representation. 
It works based on a primary or latent structure to the word pattern usage in a text document and utilizes 
statistical techniques in determining this structure. It considers latent higher-order structures in the relationship 
between terms and documents. This technique can be applied to synonymy and polysemy problems. Latent 
semantic indexing is also used for dimensionality reduction using singular value decomposition (SVD). 
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The drawback of LSI is that it uses a bag-of-words approach which can lead to unstructured information and it 
only works on singular terms [26]. 


2.2.4. Probabilistic latent semantic analysis (PLSA) 

Probabilistic LSA is a statistical technique for co-occurrence data. PLSA is derived from LSA by 
making it a probabilistic model. Probabilistic LSA is based on combination decomposition derived from 
a latent class model, unlike the standard latent semantic analysis which is from linear algebra. Documents are 
modeled as a multinomial combination of topics which gives room for document-document comparison. 
This makes PLSA a more popular technique for analysis of co-occurrence data. PLSA is used to model, 
the probability of each co-occurrence as a combination of independent multinomial distributions. 


P(w, d) = Xc P(c)P(d|c)P(wic) = P(d) Xe P(cld)P(wic) (3) 


P(w,d) is the symmetric formulation where w and d are computed from the latent class in similar ways using 
the conditional probabilities P(dlc) and P(wlc), for each document [27]. 


3. MATERIAL AND METHODS 

The 2010-2018 Publication data from randomly selected Nigerian institutions was retrieved through 
the Scopus API listed in the Scopus database. Each publication retrieved with a unique Elsevier ID is used to 
retrieve its abstract which builds a database of 9800 publication data. The abstracts are concatenated and stored 
in a database and the file format of the abstracts retrieved is in JavaScript Object Notation (JSON) file format. 


3.1. Data preprocessing 
The datasets were obtained in their raw form and required text processing and formatting to make 

them intelligible. Several operations were carried out on the datasets to extract the required text are 

discussed below: 

- Non-printable characters: Regular expressions were used to remove characters that did not conform to 
the Unicode text encoding format. 

-  Tokenization: The natural language toolkit (NLTK) class was used to perform tokenization and 
the conversion of each word to lowercase characters. 

- | Number removal: The built-in python module was used to eliminate numbers but not words representing 
numbers. 

- | Stop words removal: Using the NLTK toolkit, stop words list was used to remove frequently occurring 
English words such as conjunctions and prepositions which do not reflect the content of the text corpus. 

- Lemmatization: The NLTK wordnet and the WordNet Lemmatizer was used to obtain the stem versions 
of each word in the text corpus. 


4. THE PROPOSED SYSTEM 

The main components include dataset collation, pre-processing & database, document representation 
module, and the Pattern detection module. The system architecture is laid out in the Figure 2. The proposed 
framework consists of the presentation layer, business logic layer and the data layer, all having their roles in 
the total functionality of the system. Figure 3 presents the three-tier showing the different layers and their 
interactions. 
- Presentation layer 

From this layer, users can input a query and receive a response to their query. This layer cannot carry 
out computations on its own, but it interacts with the business logic layer through the Django web framework 
to provide more functionalities. 
- Business logic layer 

This layer consists of the Python application, which provides the functionalities to the presentation 
layer. It also interacts with the data layer through python SQLite connector to process necessary data useful for 
the working of the system. The Gensim library is a python library that is used for document representation 
implementation while the natural language toolkit handles natural language computations. The DBpedia and 
WordNet ontology is used for semantically annotating documents. 
- Data layer 

This is the layer where all the information needs of the system are stored. SQLite is used as 
the database management system platform for storing and managing records of individual reviewers. The data 
layer communicates with the business logic layer through python SQLite connector. 
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Figure 2. Proposed framework 
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4.1. Activity diagram 


Figure 3. Three-tier architecture 


In Figure 4, the flow of activities in similar research area detection is shown. The diagram displays 
the flow of activities between the user and the system. The flow of activities includes: registration, login, 
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uploading an existing publication, the system recommends similar publications, the user checks the research 
area, the result is displayed, if the user is not pleased with the result, the user will be redirected to the similar 
research area page to adjust the uploaded publication. 
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Upload 
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similar 
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Figure 4. Activity diagram 


4.2. Sequence diagram 

Sequence diagrams model the dynamic behaviour of a system; they are interactive diagrams that 
depict the passing of messages between objects in a system. The messages are passed from the user to 
the registration module, and from the registration module to the database and back to the user to display 
a successful registration. Figure 5 shows the sequence diagram of the similar research publication discovery 
process and how messages are exchanged between the different modules from the user to the publication 
repository and back to the user with the result of the message passed across. 
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Figure 5. Sequence diagram for similar publication discovery 


5. RESULTS AND DISCUSSION 
5.1. Experimental results 

The elbow curve was used to validate the optimal number of clusters beforehand, we used a range of 
values of K and selected the optimal value based on the elbow curve of the clusters. The evaluation of 
the clusters generated was performed using Silhouette analysis, which revealed a high level of accuracy and 
consistency as evidenced by an average 80% silhouette score for all the data points as shown in Figure 6. 
The figure shows that the number of clusters increases with the scores. 
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Figure 6. Evaluation results 


6. CONCLUSION 

In this work, a prototype system was developed to provide a platform for enhancing collaborations 
between researchers in a research field by enabling a researcher to identify other researchers in a given research 
field. A rigorous review of the literature was conducted to examine existing approaches that have been utilized 
in discovering similar research areas, this study utilized an approach involving the use of publication documents 
sharing certain keywords. Randomly selected Nigerian University publication documents from 2010 to 2018 
were retrieved and used as a case study. This approach was modeled as a clustering problem. In order to 
improve the accuracy of the clustering results, the DBpedia and WordNet ontology was used to capture domain 
terms and semantically related terms in the publication dataset used. LSI and TF-IDF were used to model 
the text documents and generate feature vectors. The K-Means clustering algorithm was used to cluster 
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the feature vectors. The clustering results were evaluated using the silhouette analysis technique, which 
revealed a high intra-cluster similarity of 0.8 on the average across all the data points. A prototype system was 
developed using Python programming language, while SQLite Database Management System was used to 
manage the database. Our results show that the prototype system showed high clustering accuracy and could 
be deployed for large scale utilization. 


7. CONTRIBUTION TO KNOWLEDGE 

This work contributes to the existing body of knowledge by developing a prototype system that 
integrates ontology-based pre-processing, Latent Semantic Indexing and K-Means clustering to discover 
similar research domain publications. The system determines the similarity between publication documents 
using semantic similarity techniques. The clustering results show an improvement through the use of ontology 
semantics in pre-processing the documents. It is also believed that this approach will be useful for unearthing 
hidden and implicit patterns in the document dataset. 


8. RECOMMENDATIONS AND FUTURE WORK 

For the system to be continuously relevant, the publication repository will have to undergo regular 
updates to improve the robustness of the system. Hence, for future improvements in this research, the following 
recommendations are proposed: 

- The system can be integrated with locally available conference and journal publication repositories to 
provide similar research area detection services. 

- To improve the pre-processing stage, an ontology learning model can be incorporated into the concept 
of weighting and concept-document construction stage to produce domain ontologies pertinent to 
the publication text corpus. 

- Other document clustering algorithms could be utilized, such as Bisecting K-means and 
K-medoids algorithm. 

- The use of some other document representation techniques such as Word2Vec or Recurrent Neural 
Network Language can be used to improve the semantic similarity computation and further capture 
the implicit semantics of the text corpus. 
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